Goto

Collaborating Authors

 spectral energy distance


A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing Systems

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.



A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing Systems

This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees.


Review for NeurIPS paper: A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing Systems

Additional Feedback: Comments: - Section 2: Flow-based models are not necessarily large. The new SOTA WaveFlow is a small-footprint flow-based model for raw audio. The authors may reference WaveFlow and clarify the inaccurate claim in related work section. I usually don't take such FDSD measures seriously, as it couldn't provide meaningful comparisons across different models in general, which is also observed by the authors. It would very nice to see an ablation study with MOS scores by varying three design choices: 1) w/ or w/o repulsive term, 2) single or multi-scale spectrogram loss, 3) w/ or w/o GAN loss. It will single out and emphasize the benefit of repulsive term under different circumstances.


Review for NeurIPS paper: A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing Systems

This paper proposes a strategy for parallel TTS based on spectral energy distance. It does not rely on explicit optimization of likelihood nor adversarial learning, which enjoys a more stable and consistent training. On top of that, the authors introduce a repulsive term which has shown to significantly improve the quality of the generated speech. When combined with adversarial training, the quality of speech can be further improved. Overall, this is an interesting work, technically solid and experimentally compelling.


A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing Systems

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees.


A Spectral Energy Distance for Parallel Speech Synthesis

Gritsenko, Alexey A., Salimans, Tim, Berg, Rianne van den, Snoek, Jasper, Kalchbrenner, Nal

arXiv.org Machine Learning

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.